Analyzing a 35- Year Hourly Data Record: Why So Difficult? 
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If Big Data is hard for data center systems, what about the average science user? 
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Big Data at the GES DISC 

> The Goddard Earth Sciences Data and Information Services Center offers Earth science data 
to the science community 

> GES DISC provides the Giovanni tool for exploratory data analysis 


Exploratory 
Data Analysis 


http : / /gio vanni .gsfc.nasa. go v/gio vanni/ 
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Giovanni provides 
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Quick- St art 
Exploratory Data 
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> How much data can we process for a Giovanni visualization? 

> We will then incorporate limits into the Giovanni User Interface to prevent workflow failures 

> North American Land Data Assimilation System 

> Land Hydrology model output 

> Soil moisture and temperature, runoff, 
evapotranspiration, . . . 

> Available in netCDF 

> community standard 

> random-access format 

> Archived as 1 file per timestep 

> -5000 users in 2013 


Spatial Resolution 

0.125 x 0.125 deg 

Grid Dimensions 

464 longitude x 224 latitude 

Grid Cells per timestep 

103,936 

Files / Day 

24 

Days / Yr 

365 

Years 

>35 

Total Timesteps 

>310,000 

Total Values per Variable 

- 32,000,000,000 (32 billion) 


A Simple, but Popular, Task: Get the time series for a point 

NCO (netCDF Operators): Fast tool for processing netCDF data files: http ://nco . sourceforge .net/ 

Try #1 : Wildcard on command line 

go> ncrcat -d lat, 40. 02,40. 02 -d Ion, -105 . 29, -105 . 29 NLDAS N0AH0125 H 002 soilmQ 100cm/*/* . nc out.nc 

-bash: /tools/share/COTS/nco/bin/ncrcat : Argument list too long 


311,135 files 
1 per timestep 


Try #2: Read file list from standard input (NCO feature for working with lots of files) 

go> cat in. ncrcat | ncrcat -d lat, 40. 02, 40. 02 -d Ion, -105 . 29, -105 . 29 -o out.nc 

ncrcat: ERROR Total length of fl_lst_in from stdin exceeds 10000000 characters. Possible misuse of feature 
If your input file list is really this long, post reques^rffT developer's forum 
(http://sf . net/projects/nco/forums/forum/9831) to LST 

*Later expanded after post to developer s forum. 
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311,135 files => 21,796,250 chars 


Try #3: Create local, short symlinks, then read from stdin 

ln/1979/nl979010201 . nc -> 

/ va r/sc rat ch/clynnes/agu2014/dat a/1979/ 

scrubbed . NLDAS_NOAH0125_H_002_soilm0_100cm . 19790102T0100 . nc 

SUCCESS! 
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Boulder, CO 
Soil Moisture 0-100 cm 


311,135 files => 7,165,105 chars (whew! 
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Computing an Area Average Time Series (over Colorado) 


Pre-concatenation of data can be helpful* 
> Use ncrcat to concatenate hourly into daily 
or monthly files 


Also, Divide and Conquer 

> Split input file list into sections, process, 
stitch together to reduce memory and temp, disk 


Time Length 

Avg File 

Elapsed Time (min) 

of File 

Size 

Processing 

Aggregation 

1 hour 

0.4 MB 

805 

0 

1 day 

9.1 MB 

61 

75 

1 month 

285 MB 

42 

74 


300 



> But it makes some analysis (e.g. diurnal) unwieldy 

*Thanks to R. Strub and W. Teng for the idea 


Parallel Processing Made Easy Slightly Less Difficult 

> Split input file list into sections, process in parallel , then stitch together 

> GNU make: parallel job execution and control engine (-j option) for the common folk 


# Makefile example: extract a time series* over Boulder, CO from hourly files 
SPLITFILES = $(shell split -d -I 15556 in.symlink in. && Is -cl in.??) 
OUTSPLIT = $(SPLITFILES:in.%=chunk.%.nc) 

BBOX ARGS = -d lat, 40.02, 40.02 -d lon,-1 05.29,-1 05.29 

all: split boulder.nc 


split: 


@echo $(SPLITFILES) 


1. Split list of input files (in.symlink) in! 
chunks: in.00,in.01... 


>• 



2. Extract and 

concatenate each chunk 




chunk.%.nc: in.% 

sort $ A | ncrcat -O -o $@ $(BBOX_ARGS) 

boulder.nc: $(OUTSPLIT) 

ncrcat -O -o $@ $ A 


*The actual Makefile for Area Averai 

> To Run 2 jobs in parallel: 
make split && 

make -j 2 -I 9. all 



S^^nore complicated than this point time series example. 

Colorado Area Average Time Series (Monthly Files) 


3 . When all sections are 
done, stitch them 
together 
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Parallel 

Elapsed Time (min.) 

Jobs 

netCDF-3 

netCDF-4 

1 

42 

26 

2 

42 

21 

3 

36 

20 


arallelization 
is not quite a 
silver bullet 
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> To Data Product Designers / Producers 

> Include the Time Dimension, even for single-time-step files . 

> It makes reorganizing time segments easy (with NCO’s ncrcat) 

> To Data Centers 

> Provide more remote analysis capability for scientists. 

> GrADS Data Server: http://www.iges.org/grads/gds/ 

> Live Access Server: http://ferret.pmel.noaa.gov/LAS 

> Giovanni: Federated Giovanni in progress (Talk IN 52 A-0 5, MW-2020 Friday, 11:20) 
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> Provide users with tips and tricks for Big Data processing. 

> No reason users need to re-leam our lessons the hard way. 

> Provide or point to tools and recipes for splitting and stitching data. 








